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ABSTRACT 

Determining accurate redshift distributions for very large samples of objects has be- 
come increasingly important in cosmology. We investigate the impact of extending 
cross-correlation based redshift distribution recovery methods to include small scale 
clustering information. The major concern in such work is the ability to disentangle the 
amplitude of the underlying redshift distribution from the influence of evolving galaxy 
bias. Using multiple simulations covering a variety of galaxy bias evolution scenarios, 
we demonstrate reliable redshift recoveries using linear clustering assumptions well 
into the non-linear regime for redshift distributions of narrow redshift width. Includ- 
ing information from intermediate physical scales balances the increased information 
available from clustering and the residual bias incurred from relaxing of linear con- 
straints. We discuss how breaking a broad sample into tomographic bins can improve 
estimates of the redshift distribution, and present a simple bias removal technique 
using clustering information from the spectroscopic sample alone. 

Key words: large-scale structure of the Universe — cosmology: observations — 
methods: data analysis — methods: statistical 
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1 INTRODUCTION 

Cosmological measurements require distance estimates in 
order to map the large scale structure of the universe. In 
the past this has most often been done on an object by 
object basis by obtaining spectroscopic redshifts for indi- 
vidual sources. Surveys of large samples of gala xies such as 
the Sloan Di gital Sky S urvey and its extensions (|York et al.l 
[200 : Abazai ian et al.| [2009l have been instrumental in im- 
proving cosmological measurements. However, a number of 
current and upcoming missions (DES, LSST, etc..) will at- 
tempt to measure the fundamental properties of cosmology, 
and particularly dark energy, using a variety of methods 
(e.g. weak gravitational lensing, baryon acoustic oscillations, 
etc.). Fundamental to all of these surveys is the assumption 
that the millions, or even billions, of galaxies observed by 
these instruments will be separable into redshift bins, de- 
spite the fact that the number of objects involved makes 
spectroscopic follow-up wildly impractical. Photometric red- 
shift techn iques show a good deal of promise towards this 
goal (e. g. IConnollv et"allll995l : [B"em'tej|2000l : ICunha et all 
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[iooi), but there remain questions as to whether or not they 
can meet the stringent requirements outlined in these sur- 
veys and avoid systema tic biases that could leak into dar k 
energy constraints (e. g. iMa et al.ll2006l : ICunha et al ]|20i3). 
In this paper, we examine a technique that uses cluster- 
ing between spectroscopic and photometric samples to ac- 
curately determine a photometric sample's redshift distri- 
bution. The applications of such a technique are much more 
general than the aforementioned large surveys: this tech- 
nique can be used to estimate the redshift distribution of 
nearly any data set. Even single-band detections that lack 
photometric redshift estimates can be used, as long as they 
have reliable astrometric information for the calculation of 
cross-correlation functions. 

The technique described in this paper uses the physical 
associations due to large scale clusterin g to probe redshift 
distributions. Such ideas are not new: ISeldner fc Peebles! 
(| 19791 ) cross-correlated quasars and galaxy counts to test 
for physical as sociation, thou gh they found no trend 
with redshift. iRoberts fc Odell (|l979l ) similarly cross- 
correlated quasars and ri ch galaxy clusters. More recently, 
iQuadri fc Williams! (|2010D counted pairs of galaxies at small 
angular separations between photometric redshift selected 
samples, taking advantage of physically associated pairs 
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of galaxies in order to determine an empi rical measure of 
the p hotometric redshift errors. Similarly, iBeniamin et alj 
cross-correlated photometric redshift bins to deter- 
mine the relative contamination fraction between redshift 
bins based on the magnitude of the induced angular cross- 
correlations. 

These previous techniques do not require any spec- 
troscopic sa mple and rely so l ely on photometric redshift 
information. ISchneider et al. 1 (|2006l ) discuss using cross- 
correlations of objects sorted into redshift bins in order 
to determine their redshift distribution. They mention that 
having a subset of objects with more accurately determined 
redshifts would enable tighter constraints than pho t omet - 
ric redshifts alone. Expanding on this idea, iNewmaiil (|2008l ) 
(hereafter N08) and iMatthews fc Newroanl ()2010l . I2OI2I ) de- 
scribe a technique that requires a spectroscopic sample that 
spans the redshift range of interest. In simple terms, the 
method measures the amount of overlap between the spec- 
troscopic sample divided into redshift bins and an unknown 
sample (which we will refer to as the "photometric" sample, 
though photometric redshifts are not necessary for sample 
selection). As galaxies cluster on all scales, if a spectroscopic 
bin overlaps in redshift with the photometric sample, we ex- 
pect to see an excess number of objects, whereas if there is 
no overlap we expect to simply see the average number of ob- 
jects that overlap spatially due to projection. By measuring 
the strength of the spatial cross-correlations as a function 
of redshift we can recover the redshift distribution of the 
photometric sample. A major component of the N08 tech- 
nique is an iterative method to correct fo r bias evolution 
that may occur in the sample. ISchulj (|2010l ) implemented a 
very similar technique on mock data and reported difficulty 
in distentangling the galaxy bias from the redshift distri- 
bution, a point which we will examine in this paper. It is 
only very recently that this technique has been used with 
real data (|Mitchell- Wynne et all |2012| : iNikoloudakis et all 
l2012f). including the e xact technique described in this work 
( Morrison et al.ll20ll Menard et al, in preparation). 

The N08 technique is designed to work with large scale 
correlations where the galaxy bias can be treated as lin- 
ear. However, the increasing amount of power in galaxy cor- 
relation functions due to large scale clustering means that 
there is considerable signal that is not being fully utilized at 
smaller scales (this is particularly relevant given the small 
angular extent of many deep spectroscopic surveys). Further 
study of the spectroscopic sample's non-linear bias proper- 
ties may enable us to account for the effects of bias in the 
redshift recovery. In this paper we explore the impact of 
retaining the linear assumptions while expanding the pro- 
cedure to include smaller physical scales, where the galaxy 
bias becomes non-linear in the density field. In this regime 
evolution of the galaxy bias will modulate the amplitude 
of the recovered redshift distribution. We test the efficacy of 
using the linear assumptions well into the non- linear regime. 

Additionally, in some instances we are only interested 
in the existence or absence of galaxies in a redshift interval, 
and the detailed shape of the redshift distribution is not the 
main concern. As an example, when selecting objects based 
on photometric redshifts, parameter degeneracies can lead 
to inclusion of a secondary population of objects far outside 
the intended redshift range. In such a case the exact shape 
of the redshift distribution, which can be distorted by the 



presence of evolving galaxy bias, may be secondary to de- 
tecting the presence or absence of an interloper population. 
For these reasons we examine the relative amount of infor- 
mation contained at a range of physical scales around our 
spectroscopic samples. 

The layout of the paper is as follows: in §[2] we discuss 
the algorithms used to determine the redshift distributions. 
A summary of the mock datasets is given in §|3l Results are 
presented in §|4l We conclude and present future work in §(5] 



2 METHOD 

Our main goal is to measure the redshift probability density 
function (pdf), (j){z), for a specific sample of objects that we 
will refer to as the "photometric" sample. As mentioned in 
Section [1] we do not necessarily need a photometric redshift 
measurement for our samples. However, as we will discuss, 
a sample that is selected to cover a narrow range in redshift 
leads to a more accurate recovery than that of a broad dis- 
tribution. So, while the method can be applied to almost 
any arbitrary data set, in practice it will most often use 
samples selected with photometric redshifts. We estimate 
these redshift distributions by measuring the amplitude of 
the cross-correlation signal between our photometric sample 
and a sample of objects with known redshifts. 

The angular cross-correlation between the photomet- 
ric sample and spectroscopic sample, WspiO^z) is defined in 
terms of the mean density of objects an anglular distance 9 
from objects in the spectroscopic sample: 



(E(e,z)) = Y.j,{l + wsp{e,z)) 



(1) 



where Ep is the mean surface density of photometric ob- 
jects. In practice, rather than measuring the correlation 
function in multiple angular bins and fitting a power law 
form, we measure the density of "photometric" sources in a 
single physical annulus around each individual spectroscopic 
source, from a minimum radius {r-min) to a maximum radius 
[rmax), measured in units of comoving kpc. We subdivide the 
spectroscopic sample into bins of redshift and measure the 
mean (over)density of "photometric" objects around each 
spectroscopic source within each redshift bin. After sub- 
traction of the average density expected from points ran- 
domly placed in the survey geometry and normalization, 
this is equivalent to the Natural Estimator, DD/RR — 1 
(|Kerscher et al.ll2000l ), where DD is the number of cross- 
correlated pairs within our annulus and RR is the number 
of correlated pairs from a dataset with randomized positions 
in the same survey footprint. We use the amplitude of this 
"one bin" estimate of the excess clustering as our estimator 
of (j>{z). In addition to calculating the density with uniform 
weight within the annulus, we also also compute a density 
measure where we weight each object proportional to the 
inverse of the spatial distance from the spectroscopic ob- 
ject. We will compare these estimators in the Appendix. All 
calculations in the body of the paper will use the inverse 
weighted estimator. 

In computing the projected overdensities, proper treat- 
ment of the survey area, including complicated selection and 
masks, is essential. To accomplish this goal we develop code 
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which employs the astro-stomp software packag^. The soft- 
ware uses a pixehzation of the sky to encode both the galaxy 
positions as well as the survey footprint for fast computa- 
tion of galaxy density and has the ability to encode complex 
masking and selection. 

2.1 Bias Correction 

Measuring the overdensity of galaxies around objects in the 
spectroscopic sample does not immediately give us the un- 
derlying redshift distribution: cross-correlations measure the 
object overdensity within a fixed real space annulus, but the 
clustering length of both the spectroscopic and "photomet- 
ric" samples are not necessarily constant with redshift. We 
must account for any such evolution in order to recover the 
redshift distribution. We will examine using the clustering 
length calculated from the spectroscopic sample alone to ac- 
count for bias evolution in Section [4.41 but will mainly use 
the iterative method introduced by N08. This technique de- 
scribes an iterative correction to the redshift distribution 
using estimates of the mean clustering length for the photo- 
metric sample and the (presumably known) clustering evolu- 
tion of the spectroscopic sample. Thus, the method requires 
that we have sufficient data to calculate the clustering evolu- 
tion of the spectroscopic sample, and assumes that the bias 
in the photometric sample varies linearly with the bias in 
the spectroscopic sample. This assumption will break down 
as we include measurements at small radii as the cluster- 
ing moves in to the non-linear regime. Finding the scales 
at which the bias evolution becomes too great for effective 
correction is the su bject of this paper. 

As in N08 and [Matthews fc Newmaiil l|2010l 'l (see these 
references for the full derivation), we have a relation between 
the cross-correlation function of the spectroscopic and pho- 
tometric samples and the normalized redshift distribution, 
(j}{z) of the sample that we are attempting to estimate, given 
by: 



angular autocorrelation function. Rearranging Equation [2j 
we find: 



Wsp(d, 



,sp 



dl/dz 



(2) 



where 7 is the power law slope of the correlation func- 
tion. Da is the transverse comoving distance, H{'y) = 
r(l/2)r[(7 - l)/2]/r(7/2) and l(z) is the comoving dis- 
tance to redshift z. If we assume that the correlation func- 
tion is a power law of the form Wsp{9) ~ Asp9^ ' be- 
tween Vmin and rmax then our one bin measurement of the 
overdensity is proportional to the amplitude of the cross- 
correlation signal, Asp. The unknown quantities in Equa- 
tion [2] are (f){z) and r-Qgp. We cannot evaluate ro,sp and 7^^ 
as a function of redshift directly, as we do not know the 
redshift distribution of the "photometric" sample, though 
we can estimate them using other measured quantities. As- 
suming that the galaxy bias is linear, we can estimate the 
cross-correlation parameters from power law fits to the auto- 
correlation functions of the spectroscopic and photometric 
samples, 7^^ = (7^^ + 7pp)/2 and rjj^ = (r'^'^'ppr^^)^^^ , 
where yss and ro,ss are measured from the projected corre- 
lation function, Wp{rp) of the spectroscopic sample, and ypp 
is the measured power law slope of the photometric sample 
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(j){z) = Asp{z) 



dl/dz 



Hi-fK.spDi-' 



(3) 



with the estimated ro,sp entering in the denominator, di- 
minishing the effect of the bias. The updated 4i{z) can now 
be used in Equation [2]to obtain an updated value for ro,sp, 
and the process can be iterated until convergence is reached. 
Note that this iterative procedure simply estimates a single, 
average value for ro,pp and assumes that ro,sp scales linearly 
with ro,ss to improve the redshift recovery. If ro,sp is evolv- 
ing in a non-linear fashion with redshift this method will not 
correct the redshift distribution appropriately. The shape of 
the redshift distribution also impacts the effectiveness of the 
iterative correction: as we are assuming that the clustering 
length of the photometric sample is proportional to that of 
the spectroscopic sample, the optimal iterative solution will 
work best at the mean redshift of the sample. For a compact 
and peaked redshift distribution the linear bias assumption 
will be a good approximation of the true bias. For a broad, or 
multiply peaked distribution, deviations from the linear ap- 
proximation will become more problematic. If we can break 
the photometric sample into narrow subsets in redshift it is 
possible to mitigate this problem. This will be examined in 
Section 1421 



2.2 Choice of r^in and r^ax 

The photometric overdensity is measured over a constant 
range of projected physical scales around each spectroscopic 
object, bounded by an inner and an outer radius, Vmin and 
rmax- The choice of these radii affects the recovery in a num- 
ber of ways and the values for rmin and Vmax serve as the 
primary tuning parameters for the fidelity of the recovered 
redshift distribution. 

For cases where the photometric catalog contains some 
fraction of the spectroscopic catalog a complication can 
arise. Excess signal from the cross-matches between objects 
in the catalogs may boost the recovery signal if we allow for 
rmin= 0, or if astrometric uncertainties lead to mismatch 
of spectroscopic and photometric objects. For our simula- 
tions, we have explicity excluded such objects, so small ra- 
dius matches are unafi'ected by such contamination. More 
broadly, at small physical separations (below ~ 1 Mpc) the 
clustering of objects will become stronger and increasingly 
non-linear. This increased amplitude is a strong indicator 
that there is more information to be extracted from the 
cross-correlation, albeit at the cost of bending some of the 
linear assumptions described in ij |2.1l We will discuss this 
issue further in Section [5] For the outer boundary of the an- 
nulus, as r,nax increases, more physically associated galax- 
ies will be included in the annulus, but so will an increasing 
number of unassociated background sources. The clustering 
signal declines as radius increases, so increasing Vmax can 
degrade the signal to noise ratio of the measurement. The 
optimum rmax will depend on both the clustering of the 
sample and the density of the photometric source catalog. 
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3 SIMULATED GALAXY CATALOGS 

The redshift recovery procedure is sensitive to the evolu- 
tion of galaxy bias. In order to test this sensitivity we em- 
ploy two sets of simulations: mocks based on Millennium 
light cones w i th a l imited field of view [Spririgcl ct al. 2005; 
ICroton et all 120061 ') and lar eer area mocks based on Las- 
Damas simulations (McBride et al, in prep). To cover a wide 
variety of possible scenarios we will examine four mock data 
sets: 

(i) No galaxy bias evolution. 

(ii) Evolving galaxy bias as expected for a realistic, mag- 
nitude limited sample. 

(iii) A magnitude limited selection with an additional 
stellar mass cut used to recover the magnitude limited sam- 
ple, as might be expected if our spectroscopic catalog was a 
particular galaxy type. 

(iv) A mixed case with constant bias at low redshift and 
a magnitude limited selection at higher redshift used to re- 
cover a sample with smooth bias evolution. Such a distri- 
bution might arise from multiple populations or complex 
selection criteria. 

The first two cases have spectroscopic and photometric data 
drawn from the same underlying distribution, and thus iden- 
tical galaxy bias properties. Cases iii and iv have different 
bias properties for the spectroscopic and observed samples, 
as will be discussed in the following subsections. We use the 
LasDamas simulations for the constant bias (item i) and 
mixed evolution samples (item iv), and the Millennium sim- 
ulations for the magnitude selected and stellar mass selected 
samples (items ii & iii ). 

3.1 LasDamas Based Mock Catalogs 

The LasDamas catalogs used in this paper are a customized 
galaxy data set generated from the dark matter simulations 
of the LasDamas project (McBride et al, in prepjf]. These 
galaxy mocks were constructed for testing this method, and 
do not explicitly fit to observed SDSS data, as is done in 
the full LasDamas simulations. This enabled us to extend 
the redshift range beyond 2 > 1, with samples spanning 
0.03 ^ 2 ^ 1.33, and covering a 9 x 14 degree patch of sky. 
The galaxy mocks are constructed from a static redshift out- 
put of one of the four large LasDamas boxes (the Carmen 
simulations) at 2 = 0.5. We defined friends-of-friends halos 
IIDavis et al ]|1985') with a linking length of 20% of the mean 
inter-particle separation. We assigned mock galaxies based 
on a simple 3 parameter halo occupati on distribution model 
(i.e. HOD; iBerlind fc WeinberellioO^) . To achieve the vari- 
able bias, the HOD is varied to reduce the number density 
as a function of redshift (thereby increasing the bias). The 
LasDamas simulations assume a flat ACDM cosmology with 

= 0.25, f^A = 0.75, h = 0.70, and erg = 0.8. 

We construct two spectroscopic catalogs, populating the 
same dark matter halo catalogs with two different types of 
galaxy bias applied to generate the data: 

^ We note that these mocks are not part of the "pub- 
licly available" mocks accessible from the LasDamas website 
http : //Iss .phy . vanderbilt . edu/lasdamas/ 



• A constant bias for the whole redshift range consisting 
of approximately 120,000 galaxies over 125 deg^ 

• A constant bias over (0.2 ^ 2 ^ 0.77,) then mimicking 
an apparent magnitude limited selection for 2 > 0.77 with 
about 235,000 galaxies over 500 deg^. 

We refer to the first as the "constant bias" sample and the 
second as the "mixed bias evolution" sample. For the pho- 
tometric samples we create distributions drawn from the 
same constant bias well as an additional dataset 

with bias is chosen such that the density decreases linearly 
with redshift over the range 0.2 2 ^ 1.33 (referred to as 
the "linear density evolution" sample. For these samples we 
create: 

• A bimodal sample with galaxies in the ranges 0.4 < 
2 < 0.6 and 0.8 < 2 < 1.1 containing about 350,000 for 
the constant bias sample and about 592, 000 galaxies for the 
linear density evolution sample. 

• A Gaussian centered at 2 = 0.75 and width cr^ = 0.10 
with ~ 145, 000 for the constant bias case and ~ 410, 000 
galaxies for the linear density evolution sample. 

3.2 Millennium Galaxy Mock Catalogs 

The Millenn i um Simulation galaxy mock catalogs of 
ICroton et all l|2006l fl are light cones populated with galax- 
ies generated from semi-analytic models that follow the pre- 
script ions of iCroton et a l. (2006) and Kitzbiclilcr & Whit3 
|2003). We use the four 2x2 degree "DLS" cones, de- 
signed to match the footprints of the Deep Lens Survey 
l|Wittman et al.ll2002l ). The light cones contain 17.4 million 
galaxies with redshifts spanning < 2 < 3 over 16 deg^ 
with a magnitude limited r-band depth of r = 29.0, which 
cover a redshift range large enough that significant galaxy 
bias evolution will occur. Areal coverage is limited enough 
that sample variance will be a significant factor in some 
measurements. The Millennium simulation assumes cosmo- 
logical parameters Q.m = 0.25, Q,a = 0.75, h — 0.73, and 
(J8 = 0.9. 

We construct two "spectroscopic" catalogs with known 
redshifts, and two "photometric" catalogs, where no redshift 
information is retained. For the spectroscopic sets: 

• We randomly select ~ 2% ( approximately 325,000 
galaxies) of the magnitude limited sample that will have a 
galaxy bias evolution matching that of the underlying sam- 
ple. 

• We design a galaxy sample of roughly the same size as 
the previous sample (335,000 galaxies) which contains all 
galaxies with stellar mass greater than 2.3 x 10^'' ^Q- 

We refer to the first as the "bias evolution" sample and the 
second as the "masscut" sample. Each has a surface density 
of approximately 5.6 galaxies/arcmin^. For the photomet- 
ric samples, we draw from the same underlying magnitude 
limited distribution used in the evolving bias scenario, thus 
they have identical bias evolution properties to the evolving 
bias case. We construct two samples from the simulation 
data: 

Available at: 

http : //web . me . com/darrencroton/Homepage/SDSS-DEEP2 . html 
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• A bimodal distribution with 1.9 million galaxies cover- 
ing the ranges 0.9 < z < 1.1 and 1.9 < z < 2.1. 

• A Gaussian distribution of about 690,000 galaxies cen- 
tered at z = 1.5 with width ~ 0.15. 

As will be shown in later sections, despite similar redshift 
spans and identical spectroscopic catalogs, the effects of 
evolving bias on the recovered redshift distribution for these 
two samples can be quite dissimilar. The stellar mass selec- 
tion gives our spectroscopic sample different bias properties 
than the photometric samples, enabling us to test the effi- 
cacy of the recovery algorithm in the presence of stronger, 
non-representative bias. The evolving bias scenario has iden- 
tical bias in the spectroscopic and photometric samples. The 
iterative procedure should perfectly recover the redshift dis- 
tribution when the biases are the same, however, we will see 
that this is not the case when we measure the clustering 
length based on the large linear regime scales but use non- 
linear clustering information to reconstruct the distribution. 

3.3 Clustering Measurements 

The recovery procedure requires fits to the projected cor- 
relation function of the spectroscopic datasets, as well as 
the slope and amplitude of the two point autocorrelation 
functions of the photometric samples. For the Millennium 
data sets the projected correlation functions of the spectro- 
scopic samples are well fit by a power law form, and lack 
a strong 1-halo break. As the slope of the power law shows 
little variation, we fix it at "/ss = 1-8 when fitting for the 
correlation length, ro.ss and use only projected separations 
greater than > SOOfcpc. Thus, non-linear evolution in the 
one halo regime will not be reflected in the clustering lengths 
used in the iterative corrections. We flt a parabolic form to 
the ro,ss data to smooth small scale redshift dependence 
induced by the measurement errors. For the angular corre- 
lation function fits of the photometric samples, we measure 
best fit power law slopes of 7pp — 1.75 ± 0.13 for bimodal 
and 7pp = 1.68 ± 0.11 for the Gaussian distribution. 

For the LasDamas data, both the constant bias and 
mixed bias data sets show a strong one halo component and 
a break in their projected correlation functions, with slopes 
of "fss = 1.45 for the constant bias data set and "fss — 1.8 
for the mixed bias case, nearly independent of redshift, at 
rp > 300 kpc beyond the one-halo break. Once again we fit 
for ro.ss using only this "quasi-linear" regime. 



4 RESULTS 

The effectiveness of cross-correlation methods in recovering 
redshift distributions is dependent on many factors. Some of 
these (e.g. spectroscopic completeness or galaxy bias evolu- 
tion) are determined by the survey data itself. Others (the 
scale used for the cross-correlations, the redshift binning 
of the spectroscopic samples and weighting of the cross- 
correlation pairs) are nearly free parameters that can be 
used to tune the recovery. Of these free parameters, the 
choice of scale has the most significant effect on the recovery 
due to its direct linkage to the galaxy bias dependence of the 
recovery. In the N08 iterative technique, measurements are 
made on large enough scales (several Mpc) that the linear 



bias can be removed. For the purposes of our analyis, we 
consider a much broader range of scales, from the linear to 
the quasi-linear (~ 1 Mpc) down to the non-linear scales 
of a few kiloparsecs. Our full analysis tests a wide range of 
scales, covering 3 <rmin^ 3000 kpc and 10 ^r^ax^ 5000 
kpc, but for illustrative purposes we will show results only 
for three representative decade- width scales: 3 < r < 30 
kpc, 30 < r- < 300 kpc and 300 < r < 3000 kpc. 

Before examining the redshift recoveries for these scales, 
a word about the smallest scales: Since we are using simu- 
lated data with perfect astrometry, we can distinguish per- 
fectly between galaxies at scales where real data sets with 
noise from astrometric calibration and atmospheric blurring 
would likely struggle. Applying these techniques to real data 
on those scales would likely mean that cross-contamination 
between the spectroscopic and photometric samples would 
dominate the recovered signal. We have experimented with 
differing levels of cross-contamination between our simu- 
lated samples and find that the behavior of the recovered 
distributions is highly dependent on the choice of simulated 
spectroscopic sample. To avoid this additional complication, 
we have chosen to eliminate all spectroscopic objects from 
our photometric catalogs and vice versa and defer further 
exploration of this effect to a future publication. 

We estimate errors on the redshift distributions with 
a spatial jackknife. This consists of subdividing the sample 
into TV contiguous regions of the sky, each with approxi- 
mately equal area. We then perform each measurement A'' 
times, each time leaving out one region. We then estimate 
the jackknife variance as: 

Var(a:) = ^^^(x,~x)2 (4) 

i — l 

where Xi is the measurement for the ith region and x is the 
mean for the entire sample. 

4.1 Recovery Scales and Populations 

In practial terms, several different bias scenarios may arise 
depending on the type of data selected. We might select a 
population with very little expected bias evolution (e. g. Lu- 
minous Red Galaxies), slowly evolving bias over a broad 
redshift interval (e. g. field galaxies), or complex evolu- 
tion due to the presence of multiple populations (e. g. a 
tomographic redshift bin with outliers). For this reason we 
study the redshift reconstruction in several bias scenarios. 
The shape of the redshift distribution of the photometric 
sample also plays a role: even a sample with strong galaxy 
bias evolution will not show significant relative bias change 
if the redshift interval of the recovery is sufficiently narrow. 
Conversely, even slight bias evolution over a broad redshift 
interval may become significant when we include additional 
information from non-linear scales. We test the recovery al- 
gorithm on two types of distributions to explore these effects: 
a centrally peaked Gaussian distribution and a bimodal dis- 
tribution. 

Figure [1] shows the recovered redshift distributions of 
the Gaussian photometric samples for both the LasDamas 
constant bias scenario and the Millennium evolving bias at 
our three representative scales. Red points show the distri- 
bution before the iterative correction of Equations [2] and [3] is 
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Figure 1. Recovered Gaussian redshift distributions for the LasDamas constant bias (left) and Millennium evolving bias (right) spectro- 
scopic samples for three decade width sets of r^i„ and rmax- Red points are the results before the iterative bias correction is applied, 
while black points with gray errors are after the iteration. The blue histogram shows the actual redshift distribution of the photometric 
sample. The more centrally peaked distribution is less sensitive to bias evolution than the broader bimodal distribution. 



applied, while black points show the results after the correc- 
tion. In the constant bias case the iterative correction should 
have almost no effect, which is seen in the small difference 
between the pre- and post-iteration recoveries. Interestingly, 
the method performs extremely well in the absence of bias 
evolution, down to the smallest scales, and including scales 
that span the break in the LasDamas correlation function. 

The centrally peaked Gaussian distribution, with most 
galaxies close to the mean redshift where our ro,pp estimate 
is most accurate, shows little sensitivity to effects of the 
evolving bias. More compact and symmetric photometric 
distributions will be less affected by galaxy bias evolution, 
which enables us to push the recovery to smaller scales. We 
will discuss this further in Sections 14. 21 and 14. 31 For the con- 
stant bias scenario the best fit is 50 ^ r ^ 100 kpc with a 
reduced = 1.42, though the change in is not particu- 
larly sensitive to the exact values of rmin and rmax (i- e. the 
likelihood surface is very fiat). 

For the evolving bias the best fit occurs for 100 ^ r ^ 
300 kpc with a reduced = 0.107. The value signifi- 
cantly below 1.0 shows that we are overestimating our error 
bars for the Millennium sample, which is not unexpected: 
with only 16 square degrees available in the Millennium light 
cones we use only 32 jackknife samples to estimate the errors 
on fits for 99 bins. This appears to only affect the amplitude 
of the errors, and not the overall structure of the covariance 
matrix. The centrally peaked distribution shows little sen- 
sitivity to bias evolution even down to the smallest scales 
probed, and we accurately recover the distribution at all 
scales. 



Figure [5] shows the recovered redshift distributions of 
the bimodal samples for the constant bias and evolving bias 
spectroscopic samples. The best fit for the bimodal sample 
occurs at 10 r ^ 50 kpc with a reduced of 1.16. The 
values are similar for the bimodal and Gaussian distri- 
butions, showing that in the absence of bias evolution the 
recovery performs accurately regardless of the shape of the 
redshift distribution. The most notable feature in the evolv- 
ing bias scenario is the relative amplitude of the two peaks. 
The bimodal sample shows a clear bias before the iterative 
correction is applied, with larger discrepancies as the an- 
nulus moves to smaller physical scales. This is as expected, 
since this bimodal configuration is particularly sensitive to 
bias evolution. Because the iterative correction estimates a 
single, average value of ro,pp for the sample the bias correc- 
tion is most accurate near the mean redshift of the photo- 
metric distribution. The bimodal sample has a mean redshift 
oi z = 1.59, between the two peaks where no galaxies are 
located. Also of note is the fact that even with identical bias 
evolution we introduce error into the recovered distributions 
even after iterative correction. This is mainly due to the fact 
that we (purposely) measure the clustering length using only 
large (> 300fcj3c) scales, and to a lesser extent due to the 
empirical estimation of the clustering length with finite sam- 
ples that can also introduce errors. We note, however, that 
the iterative technique does accurately recover the distri- 
butions when only the large scale clustering information is 
used, as expected. 

The effectiveness of the iteration in correcting for the 
bias is obviously reduced as the radius of the annulus de- 
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Figure 2. Recovered bimodal redshift distributions for the LasDamas constant bias (left) and Millennium evolving bias (right) spectro- 
scopic samples for three decade width sets of r^in and rmax- Red points are the results before the iterative bias correction is applied, 
while black points with gray errors are after the iteration. The blue histogram shows the actual redshift distribution of the photomet- 
ric sample. In the case of no bias evolution the recovery works well on all scales, while evolving bias induces a skew in the recovered 
distribution. 



creases, though errors due to covariance between bins also 
increase as Vmax grows and more unassociated galaxies are 
included in the estimate. The best fit values are for inter- 
mediate scales, with a minumum at 200 ^ r ^ 300 kpc 
and = 0.334. The best fits at intermediate scales balance 
the increasing influence of the bias at small scales with the 
concurrent increase in signal to noise and decreased bin to 
bin covariance that comes with smaller physical apertures. 
It is clear that small scale information greatly increases bias 
in the recovery, and should not be used to recover broad 
redshift distributions. 

Table[l]lists the reduced values for the representative 
scale distributions both before and after the N08 iteration is 
applied. The success of the iterative technique in aiding the 
recovery procedure is varied. The iteration improves the re- 
covery for every case in the evolving bias scenario, mixed re- 
sults in the constant bias and mass cut scenarios, and mainly 
degrades results in the mixed/linear case. 

For another more quantitative measure of the fidelity 
of the redshift recovery, we calculate the sample mean (av- 
erage redshift) and standard deviation (square root of the 
sample variance, i. e. the "sample width") for each distri- 
bution. Figure [3] shows the deviation from the true redshift 
mean Ztr and true sample width atr for the two cases shown 
in Figures [1] and (2] We show the three decade width bins 
and also results using information encompassing all three 
scales, with 3 ^ r < 3000 kpc as a gray shaded ellipse. As 
the information in each of the three annuli is independent. 



combining all scales should provide a higher signal-to-noise 
measurement of the redshift distribution. 

The top two panels in Figure [3] show that with little or 
no expected bias evolution using all scales works extremely 
well at recovering the photometric sample distributions. For 
the evolving data set, the presence of even modest bias evo- 
lution results in a misestimation of the mean redshift for the 
bimodel distribution, due to the relative amplitudes of the 
recovered peaks. Ifowever, the sample width is largely unaf- 
fected, as the lack of cross-correlation signal outside of the 
two bimodal bins provides a strong constraint on cr. In the 
centrally peaked Gaussian distribution the mean is more ac- 
curately recovered, but the uncertainty in the sample width 
is increased. This is due to the fact that the tails of the dis- 
tribution are now more affected by the difference in bias at 
low and high redshift. 

Unlike the top two rows, the bottom of Figure use 
simulated spectroscopic samples with very different bias pro- 
files from their respective photometric samples. The stellar 
mass cut sample has stronger bias evolution than the mag- 
nitude limited sample that comprises the observed sample. 
The effects on the bimodel distribution in the evolving bias 
scenario are exacerbated by the stronger bias in the mass 
cut sample. Once again the mean redshift for the bimodal 
sample is significantly skewed. 

To illustrate the differing bias of the three LasDamas 
samples we calculate the linear bias explicitly and show them 
in Figure ID The mixed/linear bias case uses the "mixed" 
bias data for the spectrscopic sample and the linear density 
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Table 1. Reduced For Distributions before and after the N08 iterative correction 



Annulus 


^,2 1 AT 
X I'^D 


^.2 / Tvr 
X I'^D 


w2 / AT 


^^2 / 




pre- 


post- 


prc- 


post- 




iteration 


iteration 


iteration 


iteration 






Bimodal 




Gaussian 


LD Constant Bias 










3-30 kpc 


3.34 


3.11 


1.55 


1.59 


30-300 kpc 


1.99 


1.47 


2.12 


3.33 


300-3000 kpc 


1.86 


1.64 


1.35 


2.06 


3-3000 kpc 


1.97 


1.79 


1.43 


2.61 


Evolving Bias 










3-30 kpc 


1.12 


0.67 


0.274 


0.201 


30-300 kpc 


0.566 


0.258 


0.229 


0.148 


300-3000 kpc 


0.400 


0.216 


0.264 


0.215 


3-3000 kpc 


0.496 


0.270 


0.237 


0.207 


Mass Cut 










3-30 kpc 


7.11 


7.20 


0.279 


0.276 


30-300 kpc 


1.99 


2.04 


0.156 


0.155 


300-3000 kpc 


0.381 


0.376 


0.215 


0.204 


3-3000 kpc 


0.743 


0.783 


0.514 


0.521 


LD Mixed Bias 










3-30 kpc 


228.2 


513.1 


366.08 


122.8 


30-300 kpc 


10.99 


27.90 


3.26 


4.92 


300-3000 kpc 


5.98 


10.01 


2.05 


1.78 


3-3000 kpc 


131.6 


340.3 


20.71 


34.44 



sample for the photometric sample. In this case, the normal 
tendency of the method to over-estimate signal at lower red- 
shifts is counteracted by the more rapid bias evolution in the 
spectroscopic sample, resulting in a mean recovered redshift 
near the expected value for all scales. However, the differ- 
ence in bias on the two sides of the Gaussian distribution 
results in greatly increased scatter in the recovered width. 
Note that we did not have access to this information when 
computing the recovered distributions, in fact the bias is 
never explicitly calculated in the N08 iteration. Instead, the 
clustering length of the spectroscopic sample, empirically 
measured from the correlation functions, is used to itera- 
tively determine the best value for the photometric sample 
clustering length. 

Overall, we see several trends: 



• In absence of bias evolution, recovery S/N is always 
highest at the smallest scales. 

• In the presence of modest bias evolution, intermediate 
scales (~ 100 < r < 500 kpc) offer the most reliable, highest 
S/N recovery. 

• For extreme bias evolution, larger scales (1000 < r < 
3000) offer the cleanest recovery. 

• For all cases, a centrally peaked redshift distribution 
is far less sensitive to bias evolution, although outliers can 
affect the recovered distribution width. 

• Small scale information should not be used to recover 
broad redshift distributions when the bias is known to 
evolve. 



4.2 Tomographic Binning 

The previous section discussed the recovery of broad redshift 
distributions. However, most upcoming surveys will focus 
on determining the redshift distribution for narrow tomo- 
graphic redshift bins for the purposes of measuring weak 
lensing and baryonic acoustic oscillations. 

The main limitation of the iterative method is that it 
relies on a single estimated value for ro,pp. If the clustering 
length of the photometric sample evolves differently than the 
spectroscopic sample, then the assumption that ro,ap scales 
^ '"()°sp ~ ('"o'pp'^o'ss)^''^ ^'^^ hold. The iterative method 
essentially finds the best single value for ro,pp given the data. 
However, if we can further subdivide our photometric sample 
in redshift, e. g. with some photometric redshift algorithm 
or color selection, we benefit in several ways: First, we may 
now determine a best fit ro,pp over a smaller redshift range 
for each subsample, over which the bias presumably evolves 
less. Second, having two values of ro,pp to estimate gives 
an additional free parameter. Third, the signal to noise of 
the measurement increases, as by breaking our initial pho- 
tometric dataset into multiple samples in redshift, we have 
removed a large number of physically unassociated galaxies 
from the correlation measurement that were adding to the 
background and diluting the signal. 

Figure [5] shows the result of splitting the bimodal sam- 
ple for the evolving bias dataset (the same as shown in the 
top right panel of Figure into two distinct redshift bins 
and computing the recovered distributions for each individ- 
ual bin. This is done for Tmin and rmax values of 3 kpc 
and 30 kpc, far into the non-linear regime and at much 
smaller scales than the best fit in the previous section. The 
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Figure 4. Linear galaxy bias as a function of redshift for the 
three LasDamas samples described in Section |3. II Red indicates 
the constant bias sample, green the sample with linear density 
evolution, and blue the sample with "mixed" bias evolution. . 
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Figure 3. Measured deviation from the true redshift mean and 
width of the bimodal (left) and Gaussian (right) distributions for 
all four spectroscopic data sets. The truth is shown as the black 
dot, 3-30 kpc (red), 30-300 kpc (green), 300-3000 kpc (blue), 
and 3 — 3000 kpc (gray shaded) are shown for comparison. 



red points show the recovery for the low redshift bin, while 
the blue points show the recovery for the high redshift bin 
(overlapping points have been omitted for clarity). The top 
panel shows the results of the same reconstruction using a 
single photometric sample. The change in the size of the 
errors on the two individual recoveries is related to the nor- 
malization factor enabled by the two independent estimates 
of the best fit clustering length, which boosts/lowers the 
signal-to-noise of the low/high redshift recoveries. In gen- 
eral, the more narrow the redshift range can be restricted, 
the smaller the optimum recovery scale will be, which results 
in an increase in S/N. However, the presence of catastrophic 
outliers in certain photometric redshift ranges may be a con- 
cern when applying tomographic selections. We address this 
in the following Section. 

4.3 Redshift Outlier Detection 

Selection of tomographic redshift bins in cosmological analy- 
ses, for instance with color cuts or photometric redshift cuts, 
can include data sets where degeneracies exist that include 
an unrelated population far outside the intended redshift 
range. The most prominent example in optical photomet- 
ric surveys is the common Lyman/Balmer break degeneracy, 
where low redshift [z ~ 0.2 — 0.3) blue galaxies are mistaken 
for very high redshift [z ~ 2—3) blue galaxies and vice- versa. 




Figure 5. Top: Recovered redshift distribution for the bimodal 
Millennium light cone sample for an annulus of 3 — 30fcpc. Bot- 
tom: the same sample split into two redshift bins (overlapping 
points omitted for clarity) . The bottom panel shows that the am- 
plitudes of the bimodal recovered low redshift (red) and high red- 
shift (blue) samples are significantly less biased than the union of 
the two samples recovered at once. 
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due to their similar optical colors. Such "catastrophic out- 
lier" populations often result in two bimodal peaks widely 
separated in redshift, which we have shown (S I4.ip can be 
problematic for accurate redshift recovery. However, we can 
use small scale information to diagnose the presence of such 
outlier populations. 

In nearly all cases examined previously, reconstructing 
the redshift distribution at smaller physical scales results in 
smaller uncertainties, albeit at the expense of increased bias 
sensitivity. This is expected, given the power law form of 
the correlation function we expect more signal on smaller 
scales. We also note that the method does an excellent job 
at returning a null signal in areas where there is no overlap 
between the spectroscopic and photometric samples, e. g. we 
see signal consistent with zero and small error bars outside 
the bimodal bins in Figure [2] 

We can use thes e features to test for the presence of 
interlopers, as done in iMorrison et all (|2012l l. where the au- 
thors cross-correlate a high redshift luminous blue galaxy 
sample with spectroscopically confirmed galaxies to test for 
the presence of intermediate redshift elliptical galaxies with 
similar expected colors. We construct several data sets to 
test the influence of recovery scales on sensitivity to outlier 
populations. Using the LasDamas mixed bias evolution data 
set, we construct samples where we have a primary peak at 
the redshift of interest (0.4 ^ 2 ^ 0.6) and a secondary peak 
due to color degeneracies (0.8 ^ 2 ^ 1.0) that contains 
between 0.5% and 10% of the total number of galaxies. Fig- 
ure |6] shows detection significance (in terms of a determined 
from the x^) as a function of contamination fraction. The 
inset shows one recovered distribution as an example, with 
10% contamination and using an annulus of 30 ^ r ^ 300 
kpc. 

Using the 300 r 3000 kpc annulus we cannot re- 
liably detect the secondary peak, however at smaller scales 
we see nearly all bins outside of the two peaks consistent 
with zero, and clearly detect non-zero signal in the range 
0.8 ^ 2 ^ 1.0 for contamination fractions above 2%. The 
ability to detect secondary peaks will depend on both the 
redshift evolution of the bias and the amount of separation 
between the two peaks in redshift space. The influence of the 
bias evolution when using small scale information can cause 
us to misestimate the overall contamination fraction, though 
detection of any contaminants at all may be the goal. While 
the recovery method does not directly inform us of which 
galaxies are degenerate, we can use the method to tailor 
photometric redshift cuts that lead to maximum purity in 
the sample by testing variations of the cuts and choosing 
those that minimize sample contamination. 

4.4 Alternative Bias Removal Technique 

The application of the full iterative procedure discussed so 
far requires calculation of the photometric sample angular 
autocorrelation functions. In actual surveys, complex selec- 
tion and masking often make estimation of the correlation 
functions difficult. We can simplify our analysis by, instead 
of assuming a linear relation between the spectroscopic and 
photometric samples, assume that the two samples have the 
same bias as calculated from ro.ss (or measurements from 
the literature). We take the estimates of ro,ss estimated for 
> 300fcpc discussed in in Section 13.21 and calculate the 
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Figure 6. Detection significance of the secondary peak as a func- 
tion of contamination fraction using the LasDamas mixed bias 
data set for three annuli. The inset shows an example recovery 
with 10% of the galaxies in the second peak. The higher S/N per 
bin for smaller annuli enables us to detect contaminating objects 
at much greater significance than when using large scale informa- 
tion alone. 

bias evolution of the spectroscopic sample as a function of 
redshift. In place of the full iterative procedure, we then sim- 
ply divide our initial estimate of 0(2) by this relative bias 
and renormalize. 

The top panel of Figure [7] shows a comparison be- 
tween the initial estimate (black), the final iterative correc- 
tion (red), and this alternative bias removal (blue) for the 
Millennium light cone simulation with Tmin ~ 30fcpc and 
rmax ~ 300fcpc (though the conclusions hold at both smaller 
and larger scales as well). The simple bias correction actu- 
ally outperforms the iterative solution, with a. ~ 0.40, 
compared to ~ 0.61 for the iterative method. In retro- 
spect, this is not unexpected: The photometric samples from 
the Millennium simulation used in FigureQwere drawn from 
the same underlying population as the spectrocopic sample, 
and thus have the same galaxy bias properties. The linear 
bias approximation used in the iterative correction, calcu- 
lating the correlation length the geometric mean of 
the spectroscopic correlation length and a single, constant 
value for the average photometric correlation length, actu- 
ally lessens the predicted redshift evolution of the bias, par- 
ticularly when using very wide redshift baseline for the pho- 
tometric sample. This is related to the improvements gained 
from splitting the sample into subsets in redshift, where we 
gain both in a smaller relative evolution in bias over the 
shorter redshift interval, and in the ability to estimate mul- 
tiple values of ro,pp in the different redshift intervals. 

Applying a stellar mass selection to the spectroscopic 
sample will change the galaxy bias evolution properties. The 
bottom panel of Figure [7] shows the recovery of the same 
photometric sample using the mass selected spectroscopic 
sample and corresponding bias estimate. While the differ- 
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Figure 7. Comparing the iterative and approximate bias removal 
techniques for 30 ^ r ^ 300 kpc. The top panel shows the recov- 
ered distributions for the Millennium sample with evolving bias 
using the full iterative technique (red), and using only the spec- 
troscopic bias (blue). The bottom panel shows the same recovery 
using the stellar mass selected spectroscopic sample. The simple 
bias approximation can provide as good or better estimates of the 
redshift distribution when the bias evolution of the two samples 
is similar. 



ence is less pronounced, using the spectroscopic bias to cor- 
rect the amplitude again outperforms the iterative method, 
with = 3.4 versus the = 5.1 for the iterative method. 
In practical terms, removing the bias evolution of the esti- 
mated sample using an estimate based solely on the spec- 
troscopic sample can provide results competitive or better 
than those obtained from the full iterative technique. 



5 DISCUSSION AND FUTURE WORK 

In this paper we have presented a study of a simple but pow- 
erful redshift recovery algorithm applied to realistic mock 
datasets, testing the inclusion of information from the non- 
linear clustering regime. Our galaxy density estimator is 
equivalent to a one bin measurement of the cross-correlation 
function between user adjustable physical scales of rmin and 
Tmax ■ We have shown that non- linear scales contain a wealth 
of information that can be exploited to increase S/N in the 
determination of redshift distributions compared to using 
only large scale information, and that the iterative tech- 
nique used to mitigate the effects of bias evolution works 
well beyond the linear regime used to date for narrow distri- 
butions. Due to the wide variety of bias scenarios that may 
be present in real data we are limited to somewhat quali- 
tative assessments. However, these general conclusions are 
informative in future applications of the method. 

We successfully recover the redshift distributions for 
several evolving and non-evolving galaxy bias configura- 
tions. However, the non-linear biasing does incur increasing 
amounts of error as we push to smaller and smaller radii. 
The optimum scale depends on the details of the photomet- 
ric dataset, both in terms of bias properties and extent in 
redshift: narrow redshift distributions and those with lit- 
tle expected bias evolution can exploit clustering signal well 
into the non-linear regime, while broad redshift distributions 



or complex galaxy bias should be restricted to the more 
conservative limits at larger scales. Furthermore, in our one 
bin treatment, larger values of Vmax lead to increasing co- 
variance between redshift recovery bins, which increases the 
relative error when using large scales. One must find the bal- 
ance between the increased signal-to-noise and the accompa- 
nying increased sensitivity to galaxy bias when performing 
the recovery. 

The iterative c o rrecti on suggested by N08 and 
[Matthews fc NewmanI (|2010l ) and employed in this paper 
has limitations. The assumption that the bias of the photo- 
metric sample scales linearly with the spectroscopic sample 
allows us to determine only a single value for the average 
cluster scaling between the spectroscopic and photometric 
samples via equation O The technique works well at cor- 
recting for galaxy bias when used for large scales, but be- 
gins to fail, as expected, as the non-linear information is 
included. We explored using an approximation of simply di- 
viding by the bias of the spectroscopic sample in section [4^ 
and found that this works well in many cases, though the 
same caveats that apply to the use of the iterative correc- 
tions apply. The iterative correction works best at the mean 
redshift of the photometric sample, thus narrow redshift dis- 
tributions peaked near the mean redshift are recovered much 
more accurately than broad distributions, as illustrated by 
the relative performance of the Gaussian and bimodal sam- 
ples shown in Figures [T] and [21 If we are able to subdivide 
the distribution we wish to recover into narrower redshift 
ranges then we can recover the distribution more accurately, 
as the bias should evolve less over the smaller redshift in- 
terval (assuming a smoothly varying bias evolution). This 
was illustrated in the simple example of breaking one of 
our bimodal samples into two bins in SectionU^] This is 
in line with the direction of the large future surveys (DES, 
LSST, etc.), where the strategy for determining cosmolog- 
ical parameters hinges on precisely determining the redshift 
distribution for a number of relatively narrow tomographic 
photometric redshift bins. The tomographic bins planned for 
such surveys are ideal samples for including non-linear infor- 
mation in redshift recovery. However, extra care will have to 
be taken if bins include any "catastrophic outlier" galaxies, 
where photometric redshift degeneracies cause some portion 
of the sample to lie at very different redshifts than that tar- 
geted by the selection. Such distributions will be suscepti- 
ble to biasing, particularly when including information from 
non-linear scales. In such cases, reverting to large Vmin val- 
ues may be necessary. Even in such cases, the non-linear 
regime can be used to accurately assess the presence or ab- 
sence of catastropic outliers in the sample, as illustrated in 
the tests of sample contamination discussed in section 14.31 
We plan to carry out tests on more realistic tomographic 
photometric redshift bins based on improved simulations in 
an upcoming paper. 

The method could be further improved by extending 
beyond the simple one-bin treatment used in this paper, 
particularly in cases where there is obvious non-power law 
form to the correlation functions, or where the slope of the 
power law changes substantially. For example, in the deter- 
mination of ro,ss we used only information at scales greater 
than 300 kpc, beyond the break in the correlation function, 
even when testing the recovery at the smallest scales. An 
explicit fit to both the one halo and two halo portions of 
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the correlation function would enable a more precise recov- 
ery. However, the non-linear relation between galaxies and 
underlying dark matter at small scales will still leave the 
method susceptible to the influence of galaxy bias evolution. 

Having shown that the methods discussed in this pa- 
per can accurately recover redshift distributions using small 
scale clustering information, we will follow up with analy- 
ses using real data sets for both known and wholly novel 
redshift distributions (Menard et. al, in preparation). This 
powerful technique will be an important and useful tool for 
both current and future photometric surveys. 
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APPENDIX A: ADDITIONAL RECOVERY 
PARAMETERS 

While the choice of the physical annulus defined by r^m and 
Tmax is the dominant factor in determining the redshift re- 
covery, we have the freedom to choose both an additional 
radial weighting of our pixelized aperture and the bin width 
of the spectroscopic sample. 

Al Annulus Weighting 

The power law form of the correlation function shows that 
there is an increasing amount of clustering information at 
smaller angular and physical scales, however there are also 
fewer galaxies due to the decreasing area of the annulus. 
Similarly, while larger apertures decrease shot noise in den- 
sity estimates, they also increase the number of unassoci- 
ated galaxies that are included due solely to line of sight 
projection. In addition to a "uniform" density estimator, 
where we simply divide the number of galaxies within the 
pixelized annulus by the area in physical units of Mpc, we 
test an "inverse" weighted density estimate, calculating the 
density in the pixelized annulus and weighting the density 
in each pixel by the inverse of the distance from the spec- 
troscopic object. Errors are a combination of variations due 
to large-scale-structure (which becomes a more serious is- 
sue for surveys with small areal footprint), and Poisson shot 
noise. We estimate errors on the recovery empirically with a 
spatial jackknife, which captures both sources of error. For 
illustration. Figure lAll shows the uniform vs. inverse weight- 
ing recovered redshift distributions of the constant bias bi- 
modal sample with a large outer radius of rmai=3000kpc. 
There is a clear reduction of error when using the inverse 
weighting, which is observed at nearly all scales and in all 
samples tested, thus we employ this inverse weighted esti- 
mator throughout the paper. However, because the smaller 
scales are now weighted more heavily, this estimator is more 
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Figure Al. Recovered redshift distribution for bimodal sam- 
ple of galaxies for the constant bias sample for the "uniform" 
density weight (left) and "inverse" density weight (right). The 
magenta histogram shows the actual redshift distribution of the 
photometric sample. The inverse weighting produces smaller er- 
ror estimates, but is more sensitive to the effects of non-linear 
bias evolution. 

sensitive to evolving galaxy bias. Therefore, caution should 
be used when using the inverse weighting at very small scales 
when galaxy bias evolution is known to be large. 

A2 Redshift Binning 

The choice of binning for the spectroscopic sample is an ad- 
ditional free parameter that we must choose. To construct 
our redshift distribution we take each spectroscopic galaxy 
and estimate the density of sources within the physical aper- 
ture defined by rmin and rmax- Then, we bin all spectro- 
scopic objects within a redshift interval A z and take the 
mean of the density estimates within each bin to determine 
the amplitude of the redshift distribution estimate. 

Several factors influence the uncertainties resulting 
from a specific choice of redshift binning. Errors are a com- 
bination of Poisson fluctuations, i. e. the number of spec- 
trocopic galaxies included in the bin, sample variance, and 
the fractional error in the amplitude of the cross-correlation 
function. The sample variance is fixed by large scale struc- 
ture and the areal coverage of the survey. The amplitude of 
the cross-correlation function depends on the width of the 
redshift bin, as using broader redshift bins lowers the am- 
plitude of the cross-correlation signal. Narrow redshift bins 
lead to a stronger cross correlation signal, however this must 
be balanced with Poisson noise from small samples within 
the bin. In practice, the total signal-to-noise is not a strong 
function of bin-width choice for small bins. However, the to- 
tal signal-to-noise is significantly lower when using a small 
number of very broad bins. Using a small number of bins 
effectively throws out information unnecessarily. 

Figure IA2I shows the jackknife error estimates for sev- 
eral bin size choices using the LasDamas based mock dataset 
with no bias evolution and 30 ^ r ^ 300 kpc. For the con- 
stant bias sample used in Figure IA2I the optimal scale oc- 
curs at A 2 ~ 0.005 with ~ 200 — 800 galaxies per redshift 
bin used to determine the mean density. We expect adjacent 
bins to be increasingly correlated as A z decreases, as shared 
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Figure A2. The effect of spectroscopic bin size on the recovered 
redshift distribution in a data set with no bias evolution. Errors 
are a combination of large scale structure fluctuations and Poisson 
noise in the average density estimate. The effect of Poisson fluc- 
tuations can be seen as the number of galaxies per bin decreases 
at A 2 = 0.005. 



large scale structure near the bin boundaries should become 
more important. 

All bins are correlated with each other, as expected, 
since the density estimate of background galaxies samples 
the distribution over the entire projected redshift range, 
with many galaxies falling within the physical annulus sur- 
rounding a spectroscopic object multiple times. This leads 
to a distinct correlation matrix structure: a strong diagonal 
and all off diagonal elements correlated at a similar "floor" 
level, the amplitude of which is determined by the size of 
the annulus and the width of the recovery. A redshift bin 
of width A z = 0.005 corresponds to ~ 10 — 20Mpc in co- 
moving distance for 0.2 ^ z ^ 1.33 probed in the recov- 
ery, much larger than the weighted physical distance, so the 
correlation matrix shows a strong diagonal component, but 
adjacent bins do not show excess correlation compared to 
widely separated bins. The projected nature of the measure- 
ment leads to highly correlated bins, and the full covariance 
matrix must be used for proper error estimation. 
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